TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

نویسندگان

چکیده

AbstractPre-trained models have accomplished high performances with the introduction of Transformers like Bidirectional Encoder Representations from known for BERT. Nevertheless, most these proposed been trained on represented languages (English, French, German, etc.) and few target under-represented dialects.This work introduces a feasibility study pre-training language based Tunisian dialect as an languages. The model is evaluated identification task, sentiment analysis reading comprehension question-answering task. Results demonstrate that, instead using datasets traditional sources (Wikipedia, articles, etc.), noisy web crawled data more convenient such dialect. Additionally, experiments show that reasonably small-scale dataset conducts to similar or better achievements when large-scale TunBERT reach enhance state art in all three downstream tasks. pre-trained named used fine-tuning step are publicly released.KeywordsTransformersLanguage modelsUnder-represented languagesTunBERTBERT

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Speech Recognition for Tunisian Dialect

Speech recognition for under-resourced languages represents an active field of research during the past decade. The tunisian arabic dialect has been chosen as a typical example for an under-resourced Arabic dialect. We propose, in this paper, our first steps to build an automatic speech recognition system for Tunisian dialect. Several Acoustic Models have been trained using HMM-GMM and HMM-DNN ...

متن کامل

Morphological Analysis of Tunisian Dialect

In this paper, we address the problem of the morphological analysis of an Arabic dialect. We propose a method to adapt an Arabic morphological analyzer for the Tunisian dialect (TD). In order to do that, we create a lexicon for the TD. The creation of the lexicon is done in two steps. The first step consists in adapting a Modern Standard Arabic (MSA) lexicon. We adapted a list of MSA derivation...

متن کامل

Building Ontologies to Understand Spoken Tunisian Dialect

This paper presents a method to understand spoken Tunisian dialect based on lexical semantic. This method takes into account the specificity of the Tunisian dialect which has no linguistic processing tools. This method is ontology-based which allows exploiting the ontological concepts for semantic annotation and ontological relations for speech interpretation. This combination increases the rat...

متن کامل

Automatic Detection of Transition Zones in Tunisian Dialect

This study is an extension of our last researches about the detection of transition zones based on multiresolution spectral analysis (MRS). In this paper we present the fourth step for the realization of an automatic system for Tunisian Dialect segmentation and analysis. The MRS is calculated over several Fast Fourier Transforms (FFT) of different length. It can provide a higher temporal accura...

متن کامل

A Generative Model for Multi-Dialect Representation

In the era of deep learning several unsupervised models have been developed to capture the key features in unlabeled handwritten data. Popular among them is the Restricted Boltzmann Machines (RBM). However, due to the novelty in handwritten multi-dialect data, the RBM may fail to generate an efficient representation. In this paper we propose a generative model – the Mode Synthesizing Machine (M...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Communications in computer and information science

سال: 2022

ISSN: ['1865-0937', '1865-0929']

DOI: https://doi.org/10.1007/978-3-031-08277-1_23